E-CAST: A Data Mining Algorithm for Gene Expression Data

نویسندگان

  • Abdelghani Bellaachia
  • David Portnoy
  • Yidong Chen
  • Abdel G. Elkahloun
چکیده

Data clustering methods have been proven to be a successful data mining technique in the analysis of gene expression data. The Cluster affinity search technique (CAST) developed by Ben-Dor, et. al., 1999, which has been shown to cluster gene expression data well, has two drawbacks. First, the algorithm uses a fixed initial threshold value to start the clustering. As stated in the original paper, this parameter directly affects the size and number of clusters produced. Second, the algorithm requires a final cleaning step, which takes O(n), to relocate n data points among the existing clusters. In this paper, we have developed and enhanced CAST algorithm, called E-CAST, that uses a dynamic threshold. The threshold value is computed at the beginning of each new cluster. We have implemented both CAST and E-CAST algorithms and tested their performance using three different data sets. The datasets are real gene expression data from melanoma, pheochromocytoma and brain cell tissue samples generated using micro-arrays technology. The results of both implementations were compared to the output from the hierarchical clustering program, written by Michael Eisen, with very comparable results. Not only did the final results compare favorably with the hierarchical approach, but they also indicate that the cleaning step of the original CAST algorithm may be unnecessary.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prediction of Acid Mine Drainage Generation Potential of A Copper Mine Tailings Using Gene Expression Programming-A Case Study

This work presents a quantitative predicting likely acid mine drainage (AMD) generation process throughout tailing particles resulting from the Sarcheshmeh copper mine in the south of Iran. Indeed, four predictive relationships for the remaining pyrite fraction, remaining chalcopyrite fraction, sulfate concentration, and pH have been suggested by applying the gene expression programming (GEP) a...

متن کامل

Feature Selection and Classification of Microarray Gene Expression Data of Ovarian Carcinoma Patients using Weighted Voting Support Vector Machine

We can reach by DNA microarray gene expression to such wealth of information with thousands of variables (genes). Analysis of this information can show genetic reasons of disease and tumor differences. In this study we try to reduce high-dimensional data by statistical method to select valuable genes with high impact as biomarkers and then classify ovarian tumor based on gene expression data of...

متن کامل

A Fuzzy C-means Algorithm for Clustering Fuzzy Data and Its Application in Clustering Incomplete Data

The fuzzy c-means clustering algorithm is a useful tool for clustering; but it is convenient only for crisp complete data. In this article, an enhancement of the algorithm is proposed which is suitable for clustering trapezoidal fuzzy data. A linear ranking function is used to define a distance for trapezoidal fuzzy data. Then, as an application, a method based on the proposed algorithm is pres...

متن کامل

Modification of the Fast Global K-means Using a Fuzzy Relation with Application in Microarray Data Analysis

Recognizing genes with distinctive expression levels can help in prevention, diagnosis and treatment of the diseases at the genomic level. In this paper, fast Global k-means (fast GKM) is developed for clustering the gene expression datasets. Fast GKM is a significant improvement of the k-means clustering method. It is an incremental clustering method which starts with one cluster. Iteratively ...

متن کامل

O-3: Drug Repositioning by Merging Gene Expression Data Analysis and Cheminformatics Target Prediction Approaches

The transcriptional responses of drug treatments combined with a protein target prediction algorithm was utilised to associate compounds to biological genomic space. This enabled us to predict efficacy of compounds in cMap and LINCS against 181 databases of diseases extracted from GEO. 18/30 of top drugs predicted for leukemia (e.g. Leflunomide and Etoposide) and breast cancer (e.g. Tamoxifen a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002